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Abstract 

Subspace clustering is the problem of clus¬ 
tering data points into a union of low¬ 
dimensional linear/affine subspaces. It is the 
mathematical abstraction of many important 
problems in computer vision, image process¬ 
ing and machine learning. A line of recent 
work n na m HO] provided strong theo¬ 
retical guarantee for sparse subspace cluster¬ 
ing a , the state-of-the-art algorithm for sub¬ 
space clustering, on both noiseless and noisy 
data sets. It was shown that under mild con¬ 
ditions, with high probability no two points 
from different subspaces are clustered to¬ 
gether. Such guarantee, however, is not suf¬ 
ficient for the clustering to be correct, due to 
the notorious “graph connectivity problem” 
m- In this paper, we investigate the graph 
connectivity problem for noisy sparse sub¬ 
space clustering and show that a simple post¬ 
processing procedure is capable of delivering 
consistent clustering under certain “general 
position” or “restricted eigenvalue” assump¬ 
tions. We also show that our condition is 
almost tight with adversarial noise pertur¬ 
bation by constructing a counter-example. 
These results provide the first exact cluster¬ 
ing guarantee of noisy SSC for subspaces of 
dimension greater then 3. 


1 INTRODUCTION 

The problem of subspace clustering originates from nu¬ 
merous applications in computer vision and image pro¬ 
cessing, where there are either physical laws or empir¬ 
ical evidence that ensure a given set of data points to 
form a union of linear or affine subspaces. Such data 
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points could be feature trajectories of rigid moving ob¬ 
jects captured by an affine camera [4], articulated mov¬ 
ing parts of a human body m, illumination of differ¬ 
ent convex objects under Lambertian model [9] and so 
on. Subspace clustering is also more generically used in 
agnostic learning of the best linear mixture structures 
in the data. For instance, it is used for images/video 
compression [TO], hybrid system identification, disease 
identification m as well as modeling social network 
communities [3] , studying privacy in movie recommen¬ 
dations [28] and inferring router network topology [5] . 

There is rich literature on algorithmic and theoret¬ 
ical analysis of subspace clustering n na 1 nTj. 
Among the many algorithms, sparse subspace cluster¬ 
ing (SSC) m is arguably the most well-studied due to 
its elegant formulation, strong empirical performance 
and provable guarantees to work under relatively weak 
conditions. The algorithm involves constructing a 
sparse linear representation of each data point using 
the remaining dataset as a dictionary. This approach 
embeds the relationship of the data points into a sparse 
graph and the intuition is that the data points are 
likely to choose only those points on the same sub¬ 
space to linearly represent itself. Then clustering can 
be obtained by finding connected components of the 
graph, or more robustly, using spectral clustering [4|. 

Assuming data lie exactly or approximately on a union 
of linear subspaces, Q it is shown in la ng m eq] 
that under certain separation conditions, this embed¬ 
ded graph will have no edges between any two points in 
different subspaces. This criterion of success is referred 
to as the “Self-Expressiveness Property (SEP)” [4ll24] 
and “Subspace Detection Property (SDP)” [T9]. The 
drawback is that there is no guarantee that the ver¬ 
tices within one cluster form a connected component. 
Therefore, the solution may potentially over segment 
the data points. This subtle point was originally raised 
and partially addressed in [15], reaching an answer 
that when data are noiseless and intrinsic subspace 
dimension d < 3, such over-segmentation will not oc¬ 
cur as long as all points within the same subspace are 


^affine subspaces are handled by augmenting 1 to every 
data point. 
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in general position; but when (i > 4, a counter ex¬ 
ample was provided, showing that this weak “general 
position” condition is no longer sufficient. 

In this paper, we revisit the graph connectivity prob¬ 
lem for noisy sparse subspace clustering. Inspired 
by the post-merging step presented in [4] for noise¬ 
less data, we propose in this paper a variant of noisy 
sparse subspace clustering [25] that provably produces 
perfect clustering with high probability, under certain 
“general position” or “restricted eigenvalue” assump¬ 
tions. We also provide a counter-example to show that 
our derived success conditions are almost tight under 
the adversarial noise perturbation model. This is the 
first time a subspace clustering algorithm is proven 
to give correct clustering under no statistical assump¬ 
tions on data corrupted by noise. To the best of our 
knowledge, this is also the first guarantee for Lasso 
that lower bounds the number of discoveries, which 
might be of independent interest for other problems 
that uses Lasso as a subroutine. 

1.1 Problem setup and notations 

For a vector x we use \\x\\p = to denote 

its p-norm. If p is not explicitly specified then the 2- 
norm is used. The noiseless data matrix is denoted 
as X = (cci,--- ,ccAr) ^ where n is the am¬ 

bient dimension and N denotes the number of data 
points available. Each data point Xi G is normal¬ 
ized so that it has unit two norm. We use S C 
to denote a low-dimensional linear subspace in and 
S G for an ortho normal basis of 5, where d is 

the intrinsic rank of S. For subspace clustering it is 
assumed that each data point Xi lies on a union of un¬ 
derlying subspaces IJ^Li with intrinsic dimensions 
di, • • • , di, < n. We use 2 ^ 1 , •• • , e ,L} to 

denote the ground truth cluster assignments of each 
data point in X and X^^^ = {xi G X : 2 ;^ = ^} 
to denote all data points in the ^th cluster. Define 
d{xi^S) = infy^s \\x — y \\2 as the distance between a 
point X and a linear subspace S. Since X is noiseless, 
we have d{xi^S^^^^) = 0. The objective of subspace 
clustering is to recover and { 2 ;^}^;^ 

permutations. 

Under the fully deterministic data model [19] no ad¬ 
ditional stochastic model is assumed on either the un¬ 
derlying subspaces or the data points. For noisy sub¬ 
space clustering we observe a noise-perturbed matrix 
Y = (j/i, • • • ,yjv) € where y^ = Xi + £j. The 

noise variables considered previously can be ei- 

ther deterministic (i.e., adversarial) or stochastic (e.g., 
Gaussian white noise) [24] [20] . 

Given ground-truth clustering { 2 ;^}^;^ ^ 
a similarity graph C G satisfies Self- 


Table 1: The hierarchies of assumptions on the sub¬ 
spaces. A: independent subspaces; B: disjoint 
subspaces ; C: overlapping subspaces . Note that 
A G B C C. Superscript * indicates that additional 
separation conditions are needed. 


A 

dim [5i (8)... (8) <Sl] = Yld=i dim 

B 

Si n 5^/ = 0 for all {{£,£')]£ ^ £'}. 

c 

dim(iS£ n5^') < min{dim(iS^),dim(iS^/)} 


for all {(£,£')!£ 


Table 2: Reference of assumptions on data points. 
Columns correspond to data point generation assump¬ 
tions and rows correspond to different noise regimes. 



1. Semi-Random 

2. Deterministic 

a. noiseless 

Si = 0 

Si = 0 

b. stochastic 

£i ~ A/'(0,cr^I) 

£i ~ A/'(0,cr^I) 

c. adversarial 


Ikill2<e 


Expressiveness Property (SEP, [4]) if \Oij\ > 0 implies 
Zi = Zj. Note that the reverse is not necessarily true. 
That is, Zi = Zj does not imply IC^jl > 0. 

2 RELATED WORK 

The pursuit of provable subspace clustering methods 
has seen much progress recently. Theoretical guaran¬ 
tees for several algorithms have been established in 
many regimes. At times it may get confusing what 
these results actually mean. In this section, we first 
review the different assumptions and claims in the lit¬ 
erature and then pinpoint what our contributions are. 

Table 12 lists the hierarchies of assumptions on the sub¬ 
spaces. Each row is weaker than its previous row. Ex¬ 
cept for the independent subspace assumption, which 
on its own is sufficient, results for more general mod¬ 
els typically require additional conditions on the sub¬ 
spaces and data points in each subspaces. For in¬ 
stance, the “semi-random model” assumes data points 
to be drawn i.i.d. uniformly at random from the unit 
sphere in each subspace and the more generic “de¬ 
terministic model” places assumptions on the radius 
of the smallest inscribing sphere of the symmetric 
polytope spanned by data points m or the small¬ 
est non-zero singular value of the data matrix m- 
Related theoretical guarantees of subspace clustering 
algorithms in the literature are summarized in Table 
where the assumptions about subspaces are denoted 
with capital letters “A, B, C”; different noise settings 
are referred to using lowercase letters “a,b,c” in Ta¬ 
ble Results that are applicable to SSC are high¬ 
lighted. 

As we can see from the second column of Table 
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SEP guarantees have been quite exhaustively studied 
and now we understand very well the conditions un¬ 
der which it holds. Specifically, most of the results 
are now near optimal under the semi-random model: 
SEP holds in cases even when different subspaces sub¬ 
stantially overlap, have canonical angles near 0, the 
dimension of the subspaces being linear in the ambient 
dimension, or the number of subspaces to be clustered 
is exponentially large [191 EH EDI- addition, the 
above results also hold robustly under a small amount 
of arbitrary perturbation or a large amount of stochas¬ 
tic noise [24]. In particular, it was shown in [24] that 
the amount of tolerable stochastic noise could even be 
substantially larger than the signal in both determin¬ 
istic and semi-random models. 

Nevertheless, the above-mentioned results do not rule 
out cases when the subgraph of each subspace is not 
well connected. Eor instance, an empty graph triv¬ 
ially obeys SEP. As a less trivial example, if we con¬ 
nect points in each subspace in disjoint pairs, then the 
degree of every node will be non-zero, yet the graph 
does not reveal much information for clustering. It 
is not hard to construct a problem such that Lasso- 
SSC will output exactly this. Eor the original noiseless 
SSC, the problem becomes trickier since the solution 
is more constrained. In m it was shown that when 
subspace dimension is no larger than 3, SSC outputs 
block-wise connected similarity graph under very mild 
conditions; however, the graph connectivity is easily 
broken when subspace dimension exceeds 3. Though a 
simple post-processing step was remarked in [4] Eoot- 
note 6 in Section 5] to alleviate the graph connectivity 
issue on noiseless data, it is unclear how to extend 
their method when data are corrupted by noise. 

Among other subspace clustering methods, [m and [7] 
are the only two papers that provide provable exact 
clustering guarantees for problems beyond indepen¬ 
dent subspaces (for which LRR provably gives dense 
graphs [26]). Their results however rely critically on 
the semi-random model assumption. Eor instance, [7] 
uses the connectivity of a random k-nearest neighbor 
graph on a sphere to facilitate an argument for cluster¬ 
ing consistency. In addition, these approaches do not 
easily generalize to SSC even under the semi-random 
model since the solution of SSC is considerably harder 
to characterize. In contrast, our results are much sim¬ 
pler and work generically without any probabilistic as¬ 
sumptions. 

Lastly, there is a long line of research on “projective 
clustering” in the theoretical computer science litera¬ 
ture mm- Unlike subspace clustering that posits an 
approximate union-of-subspace model, projective clus¬ 
tering makes no assumption on the data points and is 
completely agnostic. The algorithms HUE] are typ¬ 


Table 3: Summary of existing theoretical guarantees. 
(*) denotes results from this paper. 


Algorithm 

SEP 

Exact clustering 

LRR 


A-2-a 

A-2-a 

SSC 

m 

B-2-a 

- 

SSC 

ng 

C-{l,2}-a 

- 

Noisy SSC 

m 

C-{l,2}-{a,b,c} 

- 

Robust SSC 

m 

C-l-{a,b} 

- 

LRSSC 

m 

C-{l,2}-a 

A-{l,2}-a 

Thresh. SC 

[H] 

C-l-a 

- 

Robust TSC 

m 

C-l-{a,b} 

C-l-{a,b} 

Greedy SC 

m 

C-l-a 

C-l-a 

SSC 

n 

C-{l,2}-{a,b,c} 

C-{l,2}-{a,b,c} 


ically based on random projection and core-set type 
techniques, which are exponential in number of sub¬ 
spaces and/or subspace dimension. On the other hand, 
SSC based algorithms are strongly polynomial time in 
all model parameters. 

3 CLUSTERING CONSISTENT SSC 

In this section, we present and analyze variants of 
SSC algorithms that outputs consistent clustering with 
high probability. As a warm-up exercise, we first con¬ 
sider the case when data are noiseless and formally es¬ 
tablish success conditions for a simple post-processing 
procedure remarked in [4]- We then move on to our 
main result in Sec. |3.2[ a robustified version of cluster¬ 
ing consistent SSC that enjoys perfect clustering con¬ 
dition on data perturbed by a small amount of adver¬ 
sarial noise. Einally, we construct a counter-example, 
which shows that our success condition cannot be sig¬ 
nificantly improved under the adversarial noise model. 

3.1 The noiseless case 

We first review the procedure of vanilla noiseless 
Sparse Subspace Clustering (SSC, [H ED]). The first 
step is to solve the following ii optimization problem 
for each data point Xi in the input matrix X: 

min ||ci||i, s.t. Xi = Xci,Cii = 0. (3.1) 

Afterwards, a similarity graph C G is con¬ 
structed as Cij = |[c*]j| -1- |[c*]i|, where are 

optimal solutions to Eq. Einally, spectral clus¬ 

tering algorithms (e.g., m) are applied on the sim¬ 
ilarity graph C to cluster the N data points into L 
clusters as desired. Much work has shown that the sim¬ 
ilarity graph C satisfies SEP under various data and 
noise regimes giiniEiiEn]- However, as we remarked 
earlier, SEP alone does not guarantee perfect cluster¬ 
ing because the obtained similarity graph C could be 
poorly connected [15]. In fact, little is known prov- 
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Algorithm 1 Clustering consistent noiseless SSC 

Algorithm 2 Clustering consistent noisy SSC 


1 : 

2 : 

3: 


4: 


5: 

6 : 


Input: the noiseless data matrix X. 
Initialization: Normalize each column of X so 
that it has unit two norm. 

Sparse subspace clustering: Solve the opti¬ 
mization problem in Eq. (3.1) for each data point 
and obtain the similarity matrix C G De¬ 

fine an undirected graph G = {V^E) with N nodes 
and G E if and only if > 0. 

Subspace recovery: For each connected com¬ 
ponent Gr = (Vr^Er) E G, COmputC S(^r) = 
Range(Xy^) using any convenient linear algebraic 
method. Let be the L unique subspaces 

in 


Final clustering: for each connected component 
Vr with S(^r) = set Zi = £ for all points in Vr- 
Output: cluster assignments {zi}fLi and recov¬ 
ered subspaces 


ably in terms of the final clustering result albeit the 
practical success of SSC. 

We now analyze a simple post-processing procedure 
of the SSC algorithm (pseudocode displayed in Al¬ 
gorithm , which was briefly remarked in [4] . We 
formally establish that with the additional post¬ 
processing step the algorithm achieves consistent 
clustering under mild “general-position” conditions. 
This simple observation completes previous theoreti¬ 
cal analysis of SSC by bridging the gap between SEP 
and clustering consistency. 

The general position condition is formally defined in 
Definition |3.1[ which concerns the distribution of data 
points within a single subspace. Intuitively, it requires 
that no subspace contains data points that are in “de¬ 
generate” positions. Similar assumptions were made 
for the analysis of some algebraic subspace clustering 
algorithms such as GPCA [23]. The generally posi¬ 
tioned data assumption is very mild and is almost al¬ 
ways satisfied in practice. For example, it is satisfied 
almost surely if data points are i.i.d. generated from 
any continuous underlying distribution. 

Definition 3.1 (General position). Fix £ G 
{!,••• ,L}. We say X^^^ is in general position if for 
all k < di, any subset of k data points (eolumns) 
in X^^^ are linearly independent. We say X is in 
general position if X*^^^ is in general position for all 


1: Input: noisy input matrix Y, number of sub¬ 
spaces L, intrinsic dimension d and tuning param¬ 
eter A. 

2: Initialization: Normalize each column of X so 
that it has unit two norm. 

3: Noisy SSC: Solve the optimization problem in 
Eq. (3.2) with parameter A for each data point and 
obtain the similarity matrix C G Define an 

undirected graph G = (E, E) with N nodes and 
(i, jf) G E if and only if Cij > 0. 

4: Subspace recovery: For each connected compo¬ 
nent Gr = iVr^Er) E G with |W| ^ d^ randomly 
pick Vr^d E Vr containing exactly d points in Vr 
and compute = Range(Xy^^). 

5: Subspace merging: Comput e th e angular dis¬ 
tance (i(iS(^),as in Eq. ( |3.3| for each pair 
(r, r'). Merge subspaces via single linkage cluster¬ 
ing with respect to (i(-, •), until there are exactly L 
subspaces. 

6: Output: cluster assignment {zi}fLi^ with Zi = Zj 
if and only if data points i and j are in the same 
merged subspace. 


the ground truth up to permutations. 

Theorem 3.1 (SSC clustering success condition). As¬ 
sume X is in general position and no two underlying 
subspaees are identieal. Let {zi}E^^ and be 

the output of Algorithm If the similarity graph C 
satisfies the self-expressiveness property as defined in 
See. [B then there exists a permutation ir on [L] sueh 
that 7r{zi) = Zi and for a// i = 1, • • • , Y 

and ^ = 1, • • • ,1/. 

The correctness of Theorem |3.1| is quite straightfor¬ 
ward and hence we defer its complete proof to Ap- 
pendix|^ We also make some comments on the general 
identiflability and the potential application of £o op¬ 
timization on union-of-subspace structured data. As 
these remarks are only loosely connected to our main 
results, we state them in Appendix jB) Finally we re¬ 
mark that Algorithm 12 only works when the input data 
are not corrupted by noise. A non-trivial robust ex¬ 
tension is provided in the next section. 

3.2 The noisy case 


^ = 1,.-. ,L. 

With the self-expressiveness property and the addi¬ 
tional assumption that the data matrix X is in general 
position. Theorem |3.1| proves that both the cluster¬ 
ing assignments {zi}^^ and the recovered subspaces 
produced by Algorithm]^ are consistent with 


In this section we adopt a noisy input model Y = 
X + E where X is the noiseless design matrix and Y 
is the noisy input that is observed. The noise matrix 
E = (si, • • • , Sat) is assumed to be deterministic with 
Ikilb < f foi* every i = 1, • • • and some noise mag¬ 
nitude parameter ^ > 0. For noisy inputs Y a Lasso 
formulation as in Eq. (3.2) is employed for every data 
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point y^. Choices of the tuning parameter A and SEP 
success conditions for Eq. (3.2) have been comprehen¬ 
sively characterized in [24] and [20] . 


hyi-Yci||i + A||ci||i, (3.2) 

s.t. Cii = 0. 


We first propose a variant of noisy subspace cluster¬ 
ing algorithm (pseudocode listed in Algorithm]^ that 
resembles Algorithm for the noiseless setting. Eor 
simplicity we assume all underlying subspaces share 
the same intrinsic dimension d which is known a pri¬ 
ori. The key difference between Algorithm and is 
that we can no longer unambiguously identify L unique 
subspaces due to the data noise. Instead, we employ 
a single linkage clustering procedure that merges the 
estimated subspaces that are close with respect to the 
“angular distance” measure between two subspaces, 
which is defined as 

d 

d{S,S') := II sin$(5,5')llF = (pi{S,S'), 

i=l 

(3.3) 

where are canonical angles between two 

d-dimensional subspace S and S'. The angular dis¬ 
tance is closely related to the concept of subspace affin¬ 
ity defined in [H mi. In fact, one can show that 
d{S,S') = d — aff(iS,iS')^ when both S and S' are d- 
dimensional subspaces. 

In the remainder of this section we present a theorem 
that proves clustering consistency of Algorithmic Our 
key assumption is a restricted eigenvalue assumption, 
which imposes a lower bound on the smallest singular 
value of any subset of d data points within an under¬ 
lying subspace. 

Assumption 3.1 (Restricted eigenvalue assumption). 
Assume there exist constants such that for ev¬ 

ery i = 1,''' ,1/ the following holds: 

min cr^(Xrf) > > 0, (3.4) 

Xd = (xi,---,Xd)CX(^) 

where is taken over all subsets of d data points 
in the ith subspace and (Td{') denotes the dth singular 
value of an n X d matrix. 

Note that Assumption |3.1| can be thought of as a 
robustified version of the “general position” assump¬ 
tion in the noiseless case. It requires X to be not 
only in general position, but also in general position 
with a spectral margin that is at least In [4| a 
slightly weaker version of the presented assumption 
was adopted for the analysis of sparse subspace clus¬ 
tering. We remark further on the related work of re¬ 
stricted eigenvalue assumption at the end of this sec¬ 
tion. 


We continue to introduce the concept of inradius^ 
which characterizes the distribution of data points 
within each subspace and is previously proposed to 
analyze the SEP success conditions of sparse subspace 
clustering [HEll. 

Definition 3.2 (Inradius, [191 El])* Fix £ G 
{!,••• ,1/}. Let r{Q) denote the radius of the largest 
ball inscribed in a convex body Q. The inradius pi is 
defined as 

Pi = min pf'' = min r(con.Y(±x[^\ • • • , , 

l<i<Ni ^ l<i<Ni " " -L « i 

±xfP,±x^^l)), (3.5) 

where conv(') denotes the convex hull of a given point 
set. 


Note that the inradius pi is strictly between 0 and 1. 
The larger pi is, the more uniform data points are dis¬ 
tributed in the .^th cluster. With the restricted eigen¬ 
value assumption and definition of inradius, we are 
now ready to present the main theorem of this section 
which shows that Algorithm [^returns consistent clus¬ 
tering when some conditions on the design matrix, the 
noise level and range of parameters are met. 
Theorem 3.2. Assume Assumption \3. 1\ holds and fur¬ 
thermore, for all £,£' £ {1,' — , L}, £ ^ £', the follow¬ 
ing holds: 




8dC 

> • 2 ; 

(3.6) 



• (3.7) 


Assume also that the self-expressiveness property holds 
for the similarity matrix C constructed by Algorithm 
If algorithms parameter X satisfies 

2e(i+e'(i + i/p^)<A<^ (3.8) 


for every £ G {1, • • • then the clustering {zi}fLi 

output by Algorithm [^ is consistent with the ground- 
truth clustering that is, there exists a permu¬ 

tation TT on {1, • • • ,1/} such that 7r(zi) = Zi for every 

i = !,•••, A. 


A complete proof of Theorem |3.2| is given in Section [C] 
Below we make several remarks to highlight the nature 
and consequences of the theorem. 


Remark 1 Let (Amin,Amax 


of A as shown in Eq. (3.8) in Theorem 3.2 


) be the feasible range 
It can 


be shown that lim^^o ^min = 0 and lim^^o Amax = 
min^p^cr^/2 > 0 as long as >0 for all £ G 
{!,••• ,L}; that is, X is in general position. There¬ 
fore, the success condition in Theorem |3.2| reduces to 
the one in Theorem 13.11 on noiseless data when noise 
diminishes. 
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Remark 2 In [24] another range on A is 

given for success conditions of the self-expressiveness 
property. One can show that lim^^o ^'min — ^ 
lim^^o A[nax = > 0. Therefore, the feasible 

range of A for both SEP and Theorem 3.2 to hold is 
nonempty, at least for sufficiently low noise level In 
addition, the limiting values of A^ax and differ by 
a factor of /2 and the maximum tolerable signal-to- 
noise ratio on ^ differs too by a similar factor of O(cr^), 
which suggests the difficulty of consistent clustering 
as opposed to merely SEP for noisy sparse subspace 
clustering. In fact, in Sec. |3.3| we construct a counter¬ 
example showing that this dependency on a£ cannot 
be improved under the adversarial noise model. 


Remark 3 Some components of Algorithm can be 
revised to make the method more robust in practi¬ 
cal applications. Eor example, instead of randomly 
picking d points and computing their range, one could 
apply robust PC A on all points in the connected com¬ 
ponent, which is more robust to potential outliers. In 
addition, the single linkage clustering step could be re¬ 
placed by /c-means clustering, which is more robust to 
false connections in practice. 


d = 2 with L = 4 clusters. E] Consider a 2-dimensional 
subspace Si in with orthogonal basis Ui G 
and assume there are 4 data points on the subspace 
represented by 

= UiZ = Ui P . 


The minimum singular value for the first two points 
is a£ = v^- This is also the minimum singular value 
of any pairs of the given points in the subspace. By 
taking ^ = e = cr^/\/2, we can contaminate the data 
with E to obtain observation data matrix Y as 


=UiZ + E = 


-1 

0 


0 0 
1 -1 


Assume there is another subspace S 2 T Si with the 
four data points = U 2 Z, and we contaminate 

them in the same fashion into Y^^\ Noiseless SSC 
on X is trivially clustering consistent by Theorem [3.1 [ 
Noisy SSC on Y however will construct a graph that 
has exactly 4 connected components with any A that 
returns a non-zero solution. These are: 


{1,2}, {3,4}, {5,6}, {7,8} 


Remark 4 There has been extensive study of us¬ 
ing restricted eigenvalue assumptions in the analysis 
of Lasso-type problems m nsi H ng. However, in 
our problem the assumption is used in a very different 
manner. In particular, we used the restricted eigen¬ 
value assumption to prove one key lemma (Lemma 


mal solution to a Lasso problem. Such results might 
be of independent interest as a nice contribution to the 
analysis of Lasso in general. 

3.3 Discussion on Assumption |3.1| 

Assumption |3 .1 1 requires a spectral gap for every sub¬ 
set of data points in each subspace. This seems a very 
strong assumption that restricts the maximum tolera¬ 
ble noise magnitude to be very small. In this section, 
we show that this dependency on a£ is actually nec¬ 
essary for noisy SSC in the adversarial noise setting, 
which suggests that our bound in Theorem |3.2| is sharp. 

Proposition 3.1. There is a subspaee elustering prob¬ 
lem X G and a noise eonfiguration E G 

obeying adversarial noise level ^ := ||E|| 2 ,oo < ^ for 
some subspaee £ and intrinsie dimension d, sueh that 
noiseless SSC is elustering eonsistent on X, but noisy 
SSC on Y = X+E eannot perform better than random 
guessing. 


C.2) that lower bounds the support size of the opti¬ 


Spectral clustering algorithms that tries to partition 
the graph into 2 parts will not be able to work bet¬ 
ter than random labeling. Similarly, Algorithm 2 will 
also fail because the subspace spanned by the noisy 
data points in each connected components are mutu¬ 
ally orthogonal, and no “merging” procedure will be 
able to consistently recover the original subspace as¬ 
signments. □ 

The high level idea of this example is that (j£ mea¬ 
sures how close the data points in subspace £ are from 
violating the general position assumption and there¬ 
fore with an arbitrary perturbation of magnitude cr^, 
we can change at least d points to lie in an {d — 1)- 
dimensional subspace, which renders the original prob¬ 
lem non-identifiable. 


Remark 5 Eor any intrinsic dimension d > 2, we can 
construct a set of d points in general position where 
one only needs to perturb each data point by (j £/to 
made them he in a d — 1 dimensional subspace space. 
Eix any ortho normal basis of (without loss of gen¬ 
erality we work under the standard basis [ei, • • • , e^^]). 
The d points are linear combinations of these basis 
with coefficients 


/3i /32 

aij'/d aijVd 


■ 

(JilVd 


Proof. It suffices to come up with one such example. ^The construction of this counter-example can be easily 

Eor the sake of simplicity we take intrinsic dimension extended to general d cases, as we remark later. 
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Figure 1: An illustration of counter-examples constructed in Proposition |3.1| Left: a 2D example. Right: a 
3D example. The arrows in blue represent the noiseless data in general position. The arrows in red illustrate 
how a small perturbation of size /\/d can potentially break the general position assumption. 


where we set {/3^} to be the d vertices of a symmetric 
simplex in with centroid at the origin. Just to 

give a few examples, in M this is { — 1,1} and in 


this 


{[o 


The construction 


-0.5] r -0.5 

' I.-V3/2J 

of such examples is illustrated in Figure [T] In general, 
since all these vectors are orthogonal to and the 
way they are constructed ensures that the top d — 1 
singular values are all identically y^d/{d — 1), the min¬ 
imum singular value will be exactly and by adver¬ 
sarial perturbation of size Ggl\fd on each data point 
we reduce all points to a subspace and hence 

they are no longer in general position. 


4 SIMULATIONS 

In this section we report simulation results of our pro¬ 
posed algorithms on the example constructed by Nasi- 
hatkon and Hartley in [15]. It was shown in [15] that 
such an example will result in highly disconnected sim¬ 
ilarity graphs, and thus poses a unique challenge for 
spectral clustering to recover the true clustering of 
data points. In particular, consider 4-dimensional sub¬ 
spaces and for each subspace we generate data set A 
consisting of 8m data points in as follows: 


m—1 

u u {(cos Ok , sin Ok , s J, 5 'J), 

k=0 s,s'G{±1} 

(sJ, s'J, COS ^/c, sin^/c)}; 0k = k7T/m^ (4-1) 


Table 4: Relative Violation (Rel. Vio.) of SEP, clus¬ 
tering accuracy without post-processing (Ace. 1) and 
clustering accuracy with post-processing (Ace. 2) for 
Lasso SSC on noiseless and noisy data. 



Rel. Vio. 

Ace. 1 

Ace. 2 

Noiseless 

.03 

.73 

.99 

Noisy 

.09 

.77 

.93 


vation matrix X is constructed as 

X= [WiA,W2A], 

where Wi,W 2 G > 4 are different linear 

operators that map a 4-dimensional vector to an n- 
dimensional ambient space. Finally, the input matrix 
X is obtained by normalizing X so that each column 
has unit £2 norm and then adding Gaussian white noise 
with entry-wise variance jn. 

Before presenting the simulation results we first make 
some remarks on the constructed dataset X. By con¬ 
struction, X has two overlaping 4-dimensional sub¬ 
spaces with probability 1, if both Wi and W 2 are sam¬ 
pled uniformly from all orthogonal linear mappings 
from to Furthermore, noiseless data points in 
each cluster are in general position, provided that m is 
a prime number.. In m it was shown that SSC tends 
to cluster data points in each cluster into two disjoint 
clusters. Hence, the follow-up spectral clustering step 
cannot correctly merge the four learnt clusters into two 
without additional information. 


where m G A/"* and 6 G (0,1) are parameters for gen- In Figurej^we plot the similarity graph learnt by Lasso 
crating the data set. Finally, the unnormalized obser- SSC as well as spectral clustering results on both noise- 
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Figure 2: Clustering analysis on noiseless (top) and noisy (bottom) data. Left: similarity matrix produced by 
Lasso SSC. Middle: spectral clustering on the similarity matrix, with 2 clusters. Right: spectral clustering on 
the similarity matrix, with 4 clusters. 


less and noisy data. The parameters for data gener¬ 
ation are set as n = 5, m = 11, (5 = 0.2, a = 0.1 
and Lasso SSC parameter is set as A = 10“^. Figure 
shows that the similarity graph is poorly connected 
and hence if we try to directly cluster the data points 
into two clusters (the middle column of the plots), the 
spectral clustering algorithm fails completely. On the 
other hand, it does a good job in clustering the data 
points into 4 clusters. Subsequently, we could apply 
our proposed post-processing step by first computing 
the underlying low-dimensional subspace for each clus¬ 
ter and then merge those subspaces that are close in 
angular distance. As a result, near perfect clustering 
could be achieved on this synthetic dataset, as shown 
in TableWe also report the relative violation of SEP 
property^ in Table to show that the SEP property 
is very well satisfied and is hence not a contributing 
factor for the poor performance of vanilla Lasso SSC. 

5 CONCLUSION 

In this paper we investigate graph connectivity in noisy 
sparse subspace clustering. We propose a robust post¬ 
process step of noisy SSC that produces consistent 
clustering with high probability, assuming the mag¬ 
nitude of noise is sufficiently small. Our work is the 
first step toward noisy SSC with complete clustering 
guarantees, under the most general fully deterministic 

^The relative violation of SEP for a similarity graph C is 
defined as Ep,^-)G£; 1/^ ^ 

if and only if xi and Xj belong to the same cluster. 


data model. We next remark on several future direc¬ 
tions along this line of research, which could further 
improve the results presented in this paper. 


Perhaps the most important limitation of Theorem |3.2| 
is the restricted eigenvalue assumption (Assumption 


3.1). Since it concerns the smallest singular value of 


the most ill-posed subset of d data points, we are re¬ 
ally requiring the noise magnitude of ^ to be extremely 
small. In fact, we believe ai is exponentially small with 
respect to the number of data points per subspace, as¬ 
suming they are drawn uniformly from the unit low¬ 
dimensional sphere. Although getting a better depen¬ 
dency over (7i is impossible under the adversarial noise 
model (as shown in Sec. 3.3), we conjecture that the 
assumption could be relaxed when noise are stochastic 
such as Gaussian white noise. 


Another potential fruitful direction is to relax the re¬ 
quirement that the support of sparse regression for ev¬ 
ery data point consists of at least d other data points. 
With less than (d + 1) data points in a connected 
component we can no longer approximately estimate 
the intrinsic low-dimensional subspace; however, we 
might still be able to obtain some leading directions 
of the underlying subspace, which could provide valu¬ 
able information for the subspace merging step. In 
fact, Soltanokoltabi et al. proved lower bounds on 
support size in robust subspace clustering under the 
semi-random model setting m- Though their bound 
is not as tight as f](d), it may benifit from some ad¬ 
ditional post-processing step that attempts to merge 
over-clustered subspaces together. 
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Appendix A PROOFS OF THEOREMS FOR NOISELESS SSC 


We prove Theorem |3.1[ the main theorem for the noiseless clustering consistent SSC algorithm given in Sec. 3.1 


Proof of Theorem \3.1\ Fix a connected component Gr = {Vr^Er) E G. By the self-expressiveness property we 
know that all data points in W lie on the same underlying subspace It can be easily shown that if is in 
general position then \Vr\ > d£-\-l because for any Xi e at least d£ other data points in the same subspace 
are required to perfectly reconstruct Xi. Consequently, we have = 5*^^^ because Vr contains at least d£ data 


points in that are linear independent. On the other hand, due to the self-expressiveness property, for every 
^ = 1, • • • , 1/ there exists a connected component Gr such that because otherwise nodes in will 

have no edges attached, which contradicts Eq. (3.1) and the definition of G. As a result, the above argument 
shows that Algorithm [Tl achieves perfect subspace recovery; that is, there exists a permutation tt on \L] such that 
5(^) = for all 


We next prove that Algorithm achieves perfect clustering as well, that is, 7r{zi) = Zi for every i = 1, • • • , A^. 
Assume by way of contradiction that there exists i such that Zi = £ and Zi = £' ^ 7r{£). Let Gr = (Vr^Er) E G 
be the connected component in G that contains the node corresponding to Xi. Since Zi = by SEP and the 
above analysis we have S(^r) = On the other hand, because Zi = £' and data points in Vr are in 

general position, we have \ Hence, ^ with £' ^ 7r(^), which contradicts the assumption 

that no two underlying subspaces are identical. □ 


Appendix B DISCUSSION ON IDENTIFIABILITY AND 4 FORMULATION 
OF NOISELESS SUBSPACE CLUSTERING 

B.l The identifiability of noiseless subspace clustering 

If we use a more relaxed notion of identifiability, even the “general position” assumption could be dropped for 
consistent clustering. In Theorem |B.1| we define such a relaxed notion of identifiability for the union-of-subspace 
structure. 

Theorem B.l. Any set of N data points in has a partition that follows a union-of-subspace structure, where 
points in each subspaces are in general position. We call this partition the minimal union-of-subspace structure. 

Proof. Given a finite set X C We will algorithmically construct a minimal partition. Initialize set y = X. 
Start with k = 1, do the following repeatedly until it fails, then increment k, until A’ = 0: find the maximum 
number of points that lie in a hyperplane of dimension (/c +1), assign a new partition for these points and remove 
these points from y. It is clear that in this way, every partition is a distinct subspace and points in any subspace 
are in general position. □ 

One consequence of Theorem |B.I| is that if SEP holds with respect to any minimal union-of-subspace structure 
(i.e., a minimal ground truth), then Algorithm]^ will recover the correct ground truth clustering. We remark that 
SEP does not hold for any finite subset of points in BP if £i regularization is used, unless the data satisfy certain 
separation conditions m- However, in Section [B^ we propose an Iq regularization problem which achieves SEP 
(and hence consistent clustering) for any X C B^. 

We note that the minimal union-of-subspace structure may not be unique. An example is that if there is one point 
in the intersection of two subspaces with equal dimension, then this point can be assigned to either subspaces. 
Now, suppose the intersection has dimension k, there can be at most k points in the intersection, otherwise these 
points will form a new /^-dimension subspace and the original structure is no longer minimal. 

B.2 The merit of ^o-minimization and agnostic subspace clustering 

A byproduct of our result is that it also addresses an interesting question of whether it is advantageous to use 
£o over £i minimization in subspace clustering, namely 

min ||ci||o, s.t. Xi = Xci,Cii = 0. (B.l) 
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If one poses this question to a compressive sensing researcher, the answer will most likely be yes, since 
minimization is the original problem of interest and empirical evidence suggests that using iterative re-weighted 
scheme to approximate solutions often improves the quality of signal recovery. On the other hand, a 

statistician is most likely to answer just the opposite because shrinkage would often significantly reduce 

the variance at the cost of a small amount of bias. A formal treatment of the latter intuition suggests that 
regularized regression has strictly less “effective-degree-of-freedom” than the best-subset selection” [22] . 
therefore generalizes better. 

How about subspace clustering? Unlike solution that is unique almost everywhere, solutions will not be 
unique and it is easy to construct a largely disconnected graph based on optimal solutions. Using the new 
observation that we do not actually need graph connectivity, we are able to establish that minimization for 
SSC is indeed the ultimate answer for noiseless subspace clustering. 

Theorem B.2. Given any N points in any solutions to the io-variant of Algorithm^ will partition the 
points into a minimal union-of-subspaee strueture. 


Proof Define a minimal subspace with respect to point Xi in a set {xi}fLi to be the span of any points that 
minimizes (B.l) for i. Since the ordering of how data points are used does not matter in Algorithm we can 
sort the points into an ascending order with respect to the dimensionality. Now the merging procedure of these 


subspaces into a unique set of subspaces is exactly the same as the construction in the proof of Theorem |B.l 
Therefore, all solutions of the io SSC are going to be the correct partition. □ 


With slightly more effort, it can be shown that the converse is also true. Therefore, the set of solutions of 
-^o-SSC completely characterizes the set of minimal union-of-subspace structure for any set of points in In 
contrast, -^i-SSC requires additional separation condition to work. That said, it may well be the case in practice 
that -^i-SSC works better for the noisy subspace clustering in the low signal-to-noise ratio regime. It will be 
an interesting direction to explore how iterative reweighted £i minimizations and local optimization for ^^-norm 
(0 < p < 1) work in subspace clustering applications. 


Appendix C PROOFS OF THEOREMS FOR NOISY SSC 

The purpose of this section is to present a complete proof to Theorem |3.2[ our main result concerning clustering 
consistent Lasso SSC on noisy data. We first present and prove two technical propositions that will be used 
later. 

Proposition C.l. Let u he an arbitrary veetor in with \\u \\2 = 1. Then maxi<i<iv^^i/i* \{u^xf^)\ > 
for every U = !,••• , 


Proof For notational simplicity let = {xf\- 

(fT 

objective of Proposition C.l is to lower bound ||X_-=, 

r (^) 






AT, 


) and = conv(±X'^-^). The 

for any u G with ||u ||2 = 1- By definition of the 


(^) 


u 


dual norm, ||X_-=. n||oo is equal to the objective of the following optimization problem 


max 


(u,X%c} s.t. ||c||i = 1. 


(C.l) 


and hence u £ • Consequently, * u can be written as a convex combination of (signed) columns 

in that is, there exists c € with ||c||i = 1 such that x)f],c = u. Plugging the expression into 

||x)!]. m||oo > { u , pr * u ) =p7**. 




To obtain a lower bound on the objective of Eq. (C.l), note that * is the radius of the largest ball inscribed in 


Eq. (C.l) we obtain 


□ 


Proposition C.2. Let A = (ai,-- - ,a^) be an arbitrary matrix with at least m rows. Then ||a^ — 
^Range(a_i)(^i )||2 ^ crm(A), where a^i denotes all eolumns in A exeept a^. 
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Proof. Denote af- as = ai — 7^Range(a_i)(^i)- By definition, af- G Range(A) and = 0 for all i' 7 ^ i. 

Consequently, 


< 7 'm(A) < inf 


neRange(A) ||'U ||2 


||A«||2 ^ ||Aa,^||2 ||a,^||i , 

< -r— = -r— = -r— = \\a, || 2 . 


lalh 


lalh 


lalh 


□ 


We next present two key lemmas. The first lemma. Lemma C.l, shows that the estimated subspace S from noisy 


inputs is a good approximation the underlying subspace as long as the restricted eigenvalue assumption 
holds and exactly d points from the same subspace are used to construct S. 

Lemma C.l. Fix £ e {!,• • • ,1/}. Suppose S is the range of a subset of points ^ eontaining exaetly d 
noisy data points belonging to the £th subspaee. Let be the ground-truth subspaee; z.e., x 
Under Assumption \3. 1\ we have 

d{S,S^^'>) (C.2) 


(«) 




E d 
_ i=i 




Proof. Suppose = (y-f,--' and ^xfj). By the noise model ||Yd - 

2 ^ On the other hand, by Assumption 3.1 we have crd(X.d) P Wedin’s theorem (Lemma 

Appendix]^ then yields the lemma. □ 


D.l 


m 


In Lemma |C. 2 | we show that if the restricted eigenvalue assumption holds and the regularization parameter A 


is in a certain range, the optimal solution to the Lasso problem in Eq. (3.2) has at least d nonzero coefficients, 
which lead to \Vr\ > d P 1 for every connected component W in the similarity graph constructed in Algorithm 
Lemma [0.2 1 is a natural extension to the fact that at least d points should be used to reconstruct a certain data 
point for noiseless inputs, if the data matrix X is in general position. 

Lemma C.2. Assume Assumption 3.1 and the self-expressiveness property hold. For eaeh i G {!,■■■,N}, 


||ci||o ^ d if the regularization parameter A satisfies 


2^(1 + 0^(1 + 1/p^) < ^ < 


P£0-£ 


= 1 , 


,L. 


(C.3) 


Proof. Because the self-expressiveness property holds, we assume without loss of generality that the support 
set of Ci with ||ci||o = t is {yf \ • • • Assume by way of contradiction that ||c^||o < d and define = 

“ 5 ^ 7=1 ^ where • • • , Ci^d-i contain all nonzero coefficientsin Cf. Since Cf is optimal, the following 

must hold for every y^^^ with i' ^ i\ 


argmin^£R|||y-^-cy|f^||^ + 2A|c|| =0. (C.4) 

To see the necessity of Eq. (C.4), note that the optimal solution to Eq. doit c* 7 ^ 0 implies 

hP-Y^PMl + mcih < ||y^-c*yf ||l + 2A|c*| + 2A||ci||a < ||y^||i+ 2A||ci||i = ||yf) - Y^cHi + 2A||ci||a, 

where q = q + c* • e^/. This contradicts the optimality of Cf with respect to Eq. (3.2). 


By optimality conditions, Eq. (C.4) im plies \{y^^y^P)\ < A. In the remainder of the proof we will show that 
under the assumptions made in Lemma |c. 2, K?/^, > A, which results in a contradiction. 


In order to lower bound || we first bound the noiseless version of the inner product \ {x^^ x)V) \^ wh ere 


(^)\ 


x^ = x^' — X]j=i • A key observation is that x^ G and hence by Proposition 

following chain of inequality holds for any with i' ^ i\ 


C.l 


and 


C.2 


the 


he 


5 


)i> 


\\x- 


■||2 > P£ 


xf^ -V . (o 


> p£C££- 


^Some coefficients in Ci^, • • • , Ci^d-i might be zero because ||ci||i could be smaller than d — 1. 


(C.5) 
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Our next objective is to upper bound the inner product perturbation | and subsequently 

obtain a lower bound on \{y ^Note that 

{y^,yf’) = + {y^ + {x^^y^P -x^P) + {y^ -x-^,y\P - x\P); 

therefore, 


{y^,ylf^) - {x^,x'i;>)\ < l|y-^ - a;-^||||4‘l + ||t/-^||||y^f^ - < ||y-^ - a :-^||2 + Clly^lb- (C. 6 ) 








In order to upper bound || 2/^||2 and \\y-^ — x^|| 2 , note that by definition || 2/^||2 = ~ 5 ^j =2 II 2 < 

(1 + ||ci||i)(l + 0 = \\^? -^'j=2^i3yf^h < C(l + Ikilli). Hence we only need to upper bound 

||ci||i, which can be done by the following argument due to the optimality of Cf. By arguments on page 21 in 
[24] . the following upper bound on ||ci||i is proven: 

(C.7) 


1 £2 / ^ 

P£ A 


Pi 


The lower bound on A in Eq. (C.3) implies that £ < A (1 + 1/p^). Plugging this upper bound into Eq. ( |C.7| ) we 
obtain 

||ci||i < 1/p^ + C(1 + 1/p^) ^ (1 + 0(1 + Pi)^ (C.8) 

which eliminates the dependency on A. We now substitute the simplified upper bound on ||ci||i into the upper 
bound for || 2 /“^|| 2 , Wy^ — and get 


||?/^||2<(i + 0^(i + iM); 


Wy-^-x-^y <^{l + C){l + l/pe). 


(C.9) 


Combining Eq. (C.5), (C. 6 ) and (C.9) we obtain the following lower bound on |(y-‘-,y|f^)|: 

\{y^,yP) \ > Ptae - + yf{l + 1/ Pi) > ^peae, 

where the last inequality is due to the assumption that 2 £(l+£)^(l + l/p^) < \p£(7i implied by Eq. (C.3). Einally, 
since \p£(J£ > A as assumed in Eq. (C.3), we have > A, which results in the desired contradiction. □ 

Eina lly, Theorem |3.2| is a simple consequence of Lemma |C.1| and |C. 2 | because under the cond ition s of Lemma 


(C.IO) 


C .2 


every component Vr will have at least d data points. Define . Lemma C.l implies that 

d{S( r),S( r')) < Pe H K and Vr' belong to the same cluster. On the other hand, by the separation condition in 
Eq. ( |3.6| and Lemma C.l if and Vr' belong to different clusters we would have d(5(^),5(^/)) > y^. Therefore, 
the single-linkage clustering procedure in Algorithm will eventually merge estimated subspaces correectly. 


Appendix D MATRIX PERTURBATION THEOREMS 


Lemma D.l (Wedin’s theorem; Theorem 4.1, pp. 260 in EU). Let A, E G be given matriees with m > n. 

Let A have the following singular value deeomposition 


run 

uj 

A [ Vi V 2 ] = 

■ El 

0 

0 

E 2 

Uj J 


0 

0 


where Ui, U 2 , U 3 , Vi, V 2 have orthonormal eolumns and Ei and E 2 are diagonal matriees. Let A = A + E he 
a perturbed version of A and (Ui,U2,U3,Vi,V2,Si,S2) be analogous singular value deeomposition of A. Let 
^ be the matrix of eanonieal angles between Range(Ui) and Range(Ui) and © be the matrix of eanonieal angles 
between Range(Vi) and Range(Vi). If there exists (5 > 0 sueh that 

minlpiji^i - [^ 2 ]j,j \ > S and minlpiJ^^J > 6, 

i,j ' I ' ' 


II sinful, + II sin 0 |||r < 


2||E||| 

(52 


then 




















